Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): cohort left join conjoined #19725

Merged
merged 18 commits into from
Jan 16, 2024
Merged

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Jan 11, 2024

Problem

  • We were having an issue with big clients using multiple cohort breakdowns. We'd run multiple sequential queries, one for each cohort breakdown due to the breakdown value not getting returned from the subquery cohort method.
  • Introduced here feat: added trends breakdown by cohort #18028 with a bit more detail on why it was built this way

Changes

  • Adds a new cohort modifier, leftjoin_conjoined. Built very similar to the leftjoin modifier, but instead of having a LEFT JOIN for each cohort breakdown, we instead have a single LEFT JOIN for all cohorts
  • This allows us to return the cohort id as the breakdown value in the trends query when the modifier is turned on
  • The new cohort modifier logic gets ran before the resolver is ran in hogql land. This is because we reference the join in the trends AST and so the left join needs to be added before types get resolved.
  • This PR does not include turning on the modifier in any way. This is to come in a follow-up PR. Question: how do we want to conditionally turn on modifiers?!

How did you test this code?

  • Added unit tests / snapshots of the new cohort resolver

Copy link
Contributor

github-actions bot commented Jan 11, 2024

Size Change: 0 B

Total Size: 2 MB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 2 MB

compressed-size-action

@Gilbert09 Gilbert09 changed the title WIP feat(hogql): cohort left join conjoined Jan 12, 2024
@Gilbert09 Gilbert09 requested review from mariusandra and a team January 12, 2024 13:19
@Gilbert09 Gilbert09 marked this pull request as ready for review January 12, 2024 13:20
@posthog-bot
Copy link
Contributor

Hey @Gilbert09! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

@thmsobrmlr thmsobrmlr force-pushed the tom/trends-cohort-breakdowns branch from 8a9b626 to 7db98bb Compare January 15, 2024 12:07
# Conflicts:
#	mypy-baseline.txt
#	posthog/hogql_queries/insights/trends/breakdown.py
#	posthog/hogql_queries/insights/trends/trends_query_runner.py
@thmsobrmlr thmsobrmlr force-pushed the tom/trends-cohort-breakdowns branch from 7db98bb to 9265096 Compare January 15, 2024 12:08
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jumping to a call, dumping thoughts

Comment on lines +29 to +32
if (isinstance(arg.value, int) or isinstance(arg.value, float)) and not isinstance(arg.value, bool):
cohorts = Cohort.objects.filter(id=int(arg.value), team_id=context.team_id).values_list(
"id", "is_static", "name"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the float? It seems like some fields in the schema didn't come with /** @asType integer */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole breakdown value type union is pretty messed up. Would love to break it down some more into multiple fields with better typing. I've had to use # type: ignore in too many places due to this

if not isinstance(arg, ast.Constant):
raise HogQLException("IN COHORT only works with constant arguments", node=arg)

if (isinstance(arg.value, int) or isinstance(arg.value, float)) and not isinstance(arg.value, bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same float question here

Comment on lines 141 to 151
"""
SELECT person_id AS cohort_person_id, 1 AS matched, cohort_id
FROM static_cohort_people
WHERE {cohort_clause}
UNION ALL
SELECT person_id AS cohort_person_id, 1 AS matched, cohort_id
FROM raw_cohort_people
WHERE {cohort_clause}
GROUP BY cohort_person_id, cohort_id, version
HAVING sum(sign) > 0
""",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we figure out which ones are static and which are dynamic, and remove one side of this union when needed? I think this could save anywhere between a bit and a bunch of CPU cycles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can check if we're querying just static, just dynamic, or both for sure

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I see now as well why multiple joins with an OR wouldn't have really worked. Thinking now, there probably would have been a way to get it working with array joins, but it'd then still make multiple queries to the same cohorts tables, and possibly cause other issues with other aggregations. So LGTM, but I'd still see if there's something to do about the typing int/float typing. Can probably be later as well.

Edit: also there are some failing tests.

FROM cohortpeople
WHERE and(equals(cohortpeople.team_id, 420), in(cohortpeople.cohort_id, [1]))
GROUP BY cohort_person_id, cohortpeople.cohort_id, cohortpeople.version
HAVING ifNull(greater(sum(cohortpeople.sign), 0), 0)) AS __in_cohort ON equals(__in_cohort.cohort_person_id, events.person_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby: cohortpeople.sign is deprecated, shouldn't be used, this doesn't work properly here. Rather, we look at the version, if it's equal to the version on the cohort.

I see sign is still used in a few places for cohort calculations, but I think those places don't face this same issue 🙈 (for various reasons). In querying though, stick to version!

This means somehow passing the current cohort version value to the query generator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically, the problem here is some stale versions of the cohorts might have sum(sign) > 0 as well, which can then lead to extra persons being filtered, iirc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neilkakkar Cool, so, I need to grab cohort.version and filter cohort_people + removing the sign clause? 👌

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, should work well!

This means you can get rid of the group by too, which should speed things up a lot, just copy this query:

https://github.com/PostHog/posthog/blob/master/posthog/models/cohort/sql.py#L75

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dreamy, have added this in, but need to remove the group by. Thanks!

@Gilbert09 Gilbert09 merged commit 626e880 into master Jan 16, 2024
99 checks passed
@Gilbert09 Gilbert09 deleted the tom/trends-cohort-breakdowns branch January 16, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants